Introduction & Data Exploration

Row

The Problem & Data Collection

The Problem

Climate change has been an ongoing problem and one of the main factors is Carbon Dioxide emissions. The problem I want to solve: Is it possible to predict the amount of carbon dioxide a state emits with certain predictors? I want to learn about what can help us predict carbon dioxide emisisons.

The Questions

  1. What variables can I use to predict a state’s carbon dioxide emissions?
  2. What is the best model to predict a states carbon dioxide emissions?
  3. What variables can I use to predict whether a state would be considered a “higher emissions” or “low emissions” state based on my current variables?
  4. What is the best model use to answer question 3?

The Data

This data set has 50 rows and 19 variables. All of the data is from 2020. For this analysis, I will not be using ‘State’ as a variable. I also won’t be using ‘CoalTrans’ as a variable because it is 0 for all states. The data set I’m using is a combination of different data sets. Since there are only 50 rows of data, I will not do training & validation sets for classification models.

The Data

VARIABLES TO PREDICT WITH

  • TotEnergy: total energy consumed (in trillion btu)
  • Coal: energy consumed from coal (in trillion btu)
  • NaturalGas: energy consumed from natural gas (in trillion btu)
  • Petroleum: energy consumed from natural gas (in trillion btu)
  • TotFF: total energy consumed from fossil fuels, sum of Coal, NatualGas, & Petroleum (in trillion btu)
  • NuclearElectricPower: energy consumed from nuclear electric power
  • RenewableEnergy: energy consumed from renewable energy sources (in trillion btu)
  • Residential: energy consumed by the residential sector (in trillion btu)
  • Commercial: energy consumed by the commercial sector (in trillion btu)
  • Transportation: energy consumed by the transportation sector (in trillion btu)
  • Pop: population of a state
  • CoalTrans: energy from coal used for transportation (in trillion btu)
  • NaturalGasTrans:energy from natural gas used for transportation (in trillion btu)
  • PetroleumTrans: energy from petroleum used for transportation (in trillion btu)
  • TaxCredit: EV tax credit dummy variable (= 1 if tax credit; 0 otherwise)

VARIABLES WE WANT TO PREDICT

  • TotCO2: total CO2 emissions of a state
  • HighLow: CO2 emissions > 100 coded as 1, lower coded as 0

Row

Summary Stats

    State               TotCO2          HighLow       TotEnergy      
 Length:50          Min.   :  5.40   Min.   :0.00   Min.   :  125.7  
 Class :character   1st Qu.: 39.42   1st Qu.:0.00   1st Qu.:  675.2  
 Mode  :character   Median : 67.05   Median :0.00   Median : 1479.7  
                    Mean   : 91.80   Mean   :0.28   Mean   : 1854.6  
                    3rd Qu.:105.35   3rd Qu.:1.00   3rd Qu.: 2214.6  
                    Max.   :624.00   Max.   :1.00   Max.   :13480.8  
      Coal         NaturalGas       Petroleum          TotFF        
 Min.   :  0.0   Min.   :   0.2   Min.   :  68.4   Min.   :   84.0  
 1st Qu.: 19.3   1st Qu.: 265.5   1st Qu.: 224.0   1st Qu.:  619.7  
 Median :146.1   Median : 364.3   Median : 444.9   Median : 1009.7  
 Mean   :183.7   Mean   : 629.9   Mean   : 646.7   Mean   : 1460.2  
 3rd Qu.:248.1   3rd Qu.: 739.5   3rd Qu.: 682.0   3rd Qu.: 1549.6  
 Max.   :872.8   Max.   :4708.4   Max.   :6185.8   Max.   :11767.1  
 NuclearElectricPower RenewableEnergy    Residential       Commercial    
 Min.   :   0.0       Min.   :   7.70   Min.   :  36.8   Min.   :  25.3  
 1st Qu.:   0.0       1st Qu.:  84.28   1st Qu.: 137.9   1st Qu.: 105.6  
 Median :  89.6       Median : 170.80   Median : 316.1   Median : 234.2  
 Mean   : 165.0       Mean   : 228.16   Mean   : 409.7   Mean   : 332.9  
 3rd Qu.: 300.2       3rd Qu.: 280.07   3rd Qu.: 512.6   3rd Qu.: 403.8  
 Max.   :1046.8       Max.   :1150.20   Max.   :1744.1   Max.   :1630.5  
   Industrial     Transportation        Pop             CoalTrans
 Min.   :  17.7   Min.   :  39.0   Min.   :  577719   Min.   :0  
 1st Qu.: 180.1   1st Qu.: 175.3   1st Qu.: 1871866   1st Qu.:0  
 Median : 379.0   Median : 382.5   Median : 4585405   Median :0  
 Mean   : 625.4   Mean   : 486.6   Mean   : 6622169   Mean   :0  
 3rd Qu.: 573.8   3rd Qu.: 598.8   3rd Qu.: 7576690   3rd Qu.:0  
 Max.   :7265.9   Max.   :2840.2   Max.   :39576757   Max.   :0  
 NaturalGasTrans  PetroleumTrans     TaxCredit   
 Min.   :  0.00   Min.   :  39.0   Min.   :0.00  
 1st Qu.:  5.30   1st Qu.: 169.2   1st Qu.:0.00  
 Median : 11.65   Median : 360.4   Median :0.00  
 Mean   : 21.90   Mean   : 463.6   Mean   :0.34  
 3rd Qu.: 24.85   3rd Qu.: 553.3   3rd Qu.:1.00  
 Max.   :196.10   Max.   :2642.5   Max.   :1.00  

Visualization 1

Response Variables:

Total CO2 Emissions (in million metric tons)

Column

This is a histogram of the variable TotCO2 emissions which is the total CO2 emissions from a state. Most of the states fall between 0 & 200 million metric tons of carbon dioxide.

Visualization 2

Column

Response Variables: CO2 Emissions: High (1)/Low(0)

Bar Chart

Column

Box Plot

Row

We can see that the majority of states are considered “Low Emission” states, meaning that they produce less than 100 million metric tons of carbon dioxide. We can also see that “High Emission” states have a larger range of values and has 2 outliers.

Scatter Plots

Row

TotCO2 Analyses

Row

Predict CO2 Emissions Models

I will be using prediction/estimation molding techniques to predict the amount of CO2 emissions for a state. The first technique I will use is a Multiple Linear Regression. Then I will run a Decision Tree Model and compare them.

Row

Predict CO2 Emissions Model 1 (M1)

For this model my predictors were: Coal, NaturalGas, Petroleum, Transportation, Pop, & TaxCredit

Row

Adjusted R-Squared

99.98 %

RMSE

1.47

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
Coal 0.097 0.001 72.110 0.000
NaturalGas 0.052 0.001 37.045 0.000
Petroleum 0.028 0.001 18.949 0.000
PetroleumTrans 0.032 0.004 7.939 0.000
Pop 0.000 0.000 4.682 0.000
NaturalGasTrans 0.044 0.016 2.755 0.009
(Intercept) 1.143 0.452 2.527 0.016
RenewableEnergy 0.004 0.001 2.347 0.024
NuclearElectricPower 0.001 0.001 0.661 0.512
TaxCredit -0.375 0.572 -0.655 0.516

Row

Analysis Summary

After examining this model, there are some predictors that are not important in predicting CO2, so a pruned version of the model is created by removing predictors that are not significant.

Row

Predict CO2 Emissions Model 2 (M2)

For this analysis we will use a pruned Multiple Linear Regression Model. I also removed predictors that had I didn’t think were important in predicting CO2 emissions. The 3 predictors involving transportation were significant (based on their p-values) so I decided to use the Transportation variable as a predictor because it’s the sum of CoalTrans, NaturalGasTrans, & PetroleumTrans. These are the predictors in the final model: Coal, NaturalGas, Petroleum, & Transportation.

Adjusted R-Squared

99.96 %

RMSE

1.84

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
Coal 0.094 0.001 63.235 0.000
NaturalGas 0.053 0.001 43.755 0.000
Transportation 0.050 0.001 35.613 0.000
Petroleum 0.024 0.001 21.954 0.000
(Intercept) 1.005 0.429 2.344 0.024

Row

png(DecisionTree3.png = “DecisionTree3.png”, width = 3, height = 3

Row

TotCO2 Model Comparison

HighLow Analyses

Predict High/Low Emissions Models

I will be using classification modeling techniques to predict if a state would be considered a “high emissions” or “low emissions” state. The first classification technique I will use is a Nominal Logistic Model. Then I will run a Boosted Tree Model and compare the two.

Nominal Logistic Model (M4)

png(Logistic3.png = “Logistic3.png”, width = 3, height = 3

Boosted Tree Model (M5)

HighLow Model Comparison

Conclusion

Questions & Answers:

  1. What is the best model to predict a states carbon dioxide emissions?

A multiple linear regression is the best model.

  1. What variables can I use to predict a state’s carbon dioxide emissions?

The predictors used were: Coal, NaturalGas, Petroleum, & Transportation.

  1. What is the best model use to predict whether a state would be considered a “higher emissions” or “low emissions” state based on my current variables?

The best model to use is a nominal logistic model but a boosted tree model also works very well.

  1. What are the variables that can be used in the model?

The variables used were: Coal, NaturalGas, Petroleum, Pop, & Transportation.

Overall, all of the models created were useful and could predict/classify a state’s CO2 emissions well. The best models were the Multiple Linear Regression Model 2 and the Nominal Logistic Regression Model. The Multiple Linear Regression Model 1 was also good, but was more complicated. The Boosted Tree Model was almost the same as the Nominal Logistic Model. One thing that surprised me was that I didn’t need/use all of the variables I began with.

---
title: "Carbon Dioxide Emissions Data Analysis Project"
output: 
  flexdashboard::flex_dashboard:              # this is telling r to create a dashboard, also make sure every
    vertical_layout: scroll                   # part works before knitting the file together to make a dashboard
    source_code: embed  
---

-----------------------------------------------------------------------

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```


```{r load_data}
df <- read_csv("INFO 3200 Official Project Data.csv")
```

Introduction & Data Exploration {data-orientation=rows}
=======================================================================

Row {data-height=800}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
Climate change has been an ongoing problem and one of the main factors is Carbon Dioxide emissions. The problem I want to solve: Is it possible to predict the amount of carbon dioxide a state emits with certain predictors? I want to learn about what can help us predict carbon dioxide emisisons. 

#### The Questions
1. What variables can I use to predict a state's carbon dioxide emissions?
2. What is the best model to predict a states carbon dioxide emissions?
3. What variables can I use to predict whether a state would be considered a "higher emissions" or "low emissions" state based on my current variables?
4. What is the best model use to answer question 3?

#### The Data
This data set has 50 rows and 19 variables. All of the data is from 2020. For this analysis, I will not be using 'State' as a variable. I also won't be using 'CoalTrans' as a variable because it is 0 for all states. The data set I'm using is a combination of different data sets. Since there are only 50 rows of data, I will not do training & validation sets for classification models.

#### Data Sources 
* 2020 Carbon Dioxide Emissions by State: https://www.eia.gov/state/rankings/#/series/226.
* EV Tax Credit: https://www.energysage.com/electric-vehicles/costs-and-benefits-evs/ev-tax-credits/
* State Energy Consumption Estimates (1960-2020): https://www.eia.gov/state/seds/sep_use/notes/use_print.pdf 
* Transportation Sector Energy Consumption: https://www.eia.gov/state/seds/data.php?incfile=/state/seds/sep_sum/html/sum_btu_tra.html&sid=US 
* US Census 2020 Population dataset: https://www.eia.gov/state/rankings/#/series/226


### The Data
VARIABLES TO PREDICT WITH

* *TotEnergy*: total energy consumed (in trillion btu)
* *Coal*: energy consumed from coal (in trillion btu)
* *NaturalGas*: energy consumed from natural gas (in trillion btu)
* *Petroleum*:  energy consumed from natural gas (in trillion btu) 
* *TotFF*: total energy consumed from fossil fuels, sum of Coal, NatualGas, & Petroleum (in trillion btu)
* *NuclearElectricPower*:  energy consumed from nuclear electric power
* *RenewableEnergy*: energy consumed from renewable energy sources (in trillion btu)
* *Residential*: energy consumed by the residential sector (in trillion btu)
* *Commercial*: energy consumed by the commercial sector (in trillion btu) 
* *Transportation*: energy consumed by the transportation sector (in trillion btu) 
* *Pop*: population of a state
* *CoalTrans*:  energy from coal used for transportation (in trillion btu)
* *NaturalGasTrans*:energy from natural gas used for transportation (in trillion btu)
* *PetroleumTrans*: energy from petroleum used for transportation (in trillion btu)
* *TaxCredit*: EV tax credit dummy variable (= 1 if tax credit; 0 otherwise)  

VARIABLES WE WANT TO PREDICT

* *TotCO2*: total CO2 emissions of a state
* *HighLow*: CO2 emissions > 100 coded as 1, lower coded as 0

Row {data-height=750}
-----------------------------------------------------------------------

### Summary Stats
```{r,cache=TRUE}
summary(df)
```

Visualization 1 {data-icon="fa-signal"}
===================================== 

### Response Variables:
#### Total CO2 Emissions (in million metric tons)

```{r,cache=TRUE}
ggplot(df,aes(TotCO2)) + geom_histogram(bins=20, fill="cadetblue") + scale_x_continuous(breaks=seq(0,700,50)) + scale_y_continuous(breaks=seq(0,15,2)) + labs(x = "CO2 Emissions (million metric tons)") + labs(y="Count")
```

Column {data-width=500}
-----------------------------------------------------------------------
This is a histogram of the variable TotCO2 emissions which is the total CO2 emissions from a state. Most of the states fall between 0 & 200 million metric tons of carbon dioxide.

Visualization 2 {data-icon="fa-signal"}
=====================================  

Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables: CO2 Emissions: High (1)/Low(0)
#### Bar Chart

```{r, fig.width=5, fig.height=4.5}
as_tibble(select(df,HighLow) %>%
         table()) %>%  
  ggplot(aes(y=n ,x=HighLow)) + geom_bar(stat="identity", fill="cadetblue") + scale_y_continuous(breaks=seq(0,40,5))
```


Column {data-width=500}
-----------------------------------------------------------------------
##### Box Plot
```{r, fig.width=4.5, fig.height=4.5}
ggplot(df,aes(x= HighLow, y=TotCO2, group=HighLow)) + geom_boxplot()
```

Row
-----------------------------------------------------------------------
We can see that the majority of states are considered "Low Emission" states, meaning that they produce less than 100 million metric tons of carbon dioxide. We can also see that "High Emission" states have a larger range of values and has 2 outliers.

Scatter Plots {data-icon="fa-signal"}
=====================================  

![](FFCO2.png)

Row
-----------------------------------------------------------------------

![](CO2FF.png)



![](PopCO22.png)

![](PopCO2Info2.png)

![](TransCO22.png)

![](TransCO2Info.png)

TotCO2 Analyses {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict CO2 Emissions Models
I will be using prediction/estimation molding techniques to predict the amount of CO2 emissions for a state. The first technique I will use  is a Multiple Linear Regression. Then I will run a Decision Tree Model and compare them.

Row
-----------------------------------------------------------------------

### Predict CO2 Emissions Model 1 (M1)
For this model my predictors were: Coal, NaturalGas, Petroleum, Transportation, Pop, & TaxCredit

Row
-----------------------------------------------------------------------

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
M1 <- lm(TotCO2 ~ Pop + Coal + NaturalGas + Petroleum + NuclearElectricPower + RenewableEnergy + CoalTrans + NaturalGasTrans + PetroleumTrans + TaxCredit, data = df)
summary(M1)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(M1)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(M1)$adj.r.squared,4)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```
### RMSE

```{r, cache=TRUE}
Sig<-round(summary(M1)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```


Row
-----------------------------------------------------------------------

### Regression Output

```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(M1)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(M1))[,4])  
out <- coef(summary(M1))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```


Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, there are some predictors that are not important in predicting CO2, so a pruned version of the model is created by removing predictors that are not significant.

Row
-----------------------------------------------------------------------

### Predict CO2 Emissions Model 2 (M2)
For this analysis we will use a pruned Multiple Linear Regression Model. I also removed predictors that had I didn't think were important in predicting CO2 emissions. The 3 predictors involving transportation were significant (based on their p-values) so I decided to use the Transportation variable as a predictor because it's the sum of CoalTrans, NaturalGasTrans, & PetroleumTrans. These are the predictors in the final model: Coal, NaturalGas, Petroleum, & Transportation.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
M2 <- lm(TotCO2 ~ Coal + NaturalGas + Petroleum+ Transportation, data = df)
summary(M2)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(M2)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(M2)$adj.r.squared,4)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(M2)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```
Row
-----------------------------------------------------------------------


### Regression Output

```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(M2)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(M2))[,4])  
out <- coef(summary(M2))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```


Row
-----------------------------------------------------------------------

![](DecisionTree3.png)
png(DecisionTree3.png = "DecisionTree3.png", width = 3, height = 3

Row
-----------------------------------------------------------------------

### TotCO2 Model Comparison
![](CO2ModelComp.png)

HighLow Analyses
=====================================  


### Predict High/Low Emissions Models
I will be using classification modeling techniques to predict if a state would be considered a "high emissions" or "low emissions" state. The first classification technique I will use is a Nominal Logistic Model. Then I will run a Boosted Tree Model and compare the two.


### Nominal Logistic Model (M4)

![](Logistic3.png)
png(Logistic3.png = "Logistic3.png", width = 3, height = 3

### Boosted Tree Model (M5)

![](Boosted3.png)

### HighLow Model Comparison
![](ModelComp3.png)

Conclusion
=====================================

### Questions & Answers:
1. What is the best model to predict a states carbon dioxide emissions?

  A multiple linear regression is the best model. 

2. What variables can I use to predict a state's carbon dioxide emissions?

  The predictors used were: Coal, NaturalGas, Petroleum, & Transportation.

3. What is the best model use to predict whether a state would be considered a "higher emissions" or "low emissions" state based on my current variables?

  The best model to use is a nominal logistic model but a boosted tree model also works very well.

4. What are the variables that can be used in the model?

  The variables used were: Coal, NaturalGas, Petroleum, Pop, & Transportation.
  
Overall, all of the models created were useful and could predict/classify a state's CO2 emissions well. The best models were the Multiple Linear Regression Model 2 and the Nominal Logistic Regression Model. The Multiple Linear Regression Model 1 was also good, but was more complicated. The Boosted Tree Model was almost the same as the Nominal Logistic Model. One thing that surprised me was that I didn't need/use all of the variables I began with.